This data set consists of a number of results from objective tests (e.g. pH,alcohol,density) on White wines. It also includes a quality rating, which is based on sensory data (median of at least 3 evaluations made by wine experts) between 0 (very bad) and 10 (very excellent).
The citation of the data set is as below
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at:
[@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
First, performing some preliminary exploration of the data set to understand the structure and view the summaries of different variables
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
In the data set, the mean quality is 5.878, with a median quality of 6. Density and pH appear to be less dispersed than other variables, with their mean & median rather close.
To better visualize the distribution of quality, a bar chart is plotted below, with the color also reflecting the quality for better visualization. Also, creating a new variable (quality.f), which comes from factoring the original quality variable. This transformation may be useful in the future as it is now a discrete variable
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
Matching with our expectations, 6 has the highest count of wines.
Further analyzing the variables, it can be seen that both total and free sulfur dioxide are present in the data set. However, it seems that “bound” sulfur dioxide (total-free) is missing. Hence, this additional variable is created below, such that
bound.sulfur.dioxide = total.sulfur.dioxide-free.sulfur.dioxide
The summary of this new variable is as below
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 78.0 100.0 103.1 125.0 331.0
Now, to visually explore each of the other variables, they were plotted using grid arrange
Most of the plots seem to roughly follow a normal distribution, with the exception of residual sugar. Perhaps transforming it with the log function will result in a better trans formation.
Trying out the log transform on the residual sugar variable below to address the long tail
The residual sugar histogram now shows a bi-modal pattern. This data might be useful in further analysis in the following sections
Checking below to see how many wines have total SO2 > 50, which is the threshold value when it starts becoming noticeable in the taste
## [1] 0.9899959
Most of the wines (99%) in this database have a high SO2 content, which should be noticeable in their taste.Lets see in the further sections if this is valid
For most wines, pH should between 3 and 4. Lets see below how many wines fall outside this expected value range of pH
## [1] 0.08922009
Less than .01% lie outside the 3-4 range, with the most clustered between 2.9 and 3.
However, there seem to be some outliers in the data. Outliers can reduce the robustness of data analysis and can induce bias.Representing outliers with the graphs below
Although outliers can be clearly seen, they are not deleted from the data set as they could potentially hold some insights into the data.
Creating a final iteration of the grid of the variables, not plotting the outliers (bottom & top 1%) for all variables
There are 11 quantitative features and 4,898 observations in the data set.
The final variable is quality which was converted to a factor, since it provides a rating from 0 to 10.
The actual quality varies from 3 to 9, with 6 as median and 5.88 as mean.
A useful final model should be able to predict the “quality” of the wine. Hence, that is the key feature of interest.
Many of the other features likely have an impact on the final quality. For example alcohol,density, pH appear to be of interest
As the data only included free and total sulfur dioxide (sum of free & bound), I created a new variable - “bound sulfur dioxide” = Total - Free Sulfur Di oxide.
This variable may provide additional insights going forward in the analysis.
I made the following changes in the data set
Added a new variable, quality.f, which is a factor of the original quality variable, as the ratings can be better analyzed as discrete values.
The sugar and alcohol data did not appear to be normally distributed among all the other variables.Hence, I used the Log transformation on the residual sugar and alcohol values. Thus they became closer to a normal distribution
As a first step, the ggpairs function was used to get an overview of the various relationships existing between pairs of variables. The X and the quality.f variables were excluded to have a cleaner analysis.
Some of the variable names have been shortened to increase readability, e.g. b.SO2 = bound sulfur dioxide
## 'data.frame': 4898 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ quality.f : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...
## $ bound.sulfur.dioxide: num 125 118 67 139 139 67 106 125 118 101 ...
The highest correlations with quality (positive or negative) which came from the above result are listed below
Quality and alcohol: 0.436
Quality and density: -0.307
Quality and Bound Sulfur Dioxide: -0.218
Quality and chlorides: -0.210
Quality and Total Sulfur Di Oxide :-.175
None of the values point towards a strong correlation !
In order to further analyse these variables, I am plotting below a scatter plot with for alcohol vs Quality. Including jitter, alpha & removing 1% outliers on both ends for better visualization.
Additionally, adding a geom_smooth overlay on the data to get a sense of the variation
## [1] 0.4355747
The o.436 correlation can be seen, as higher % of alcohol is leading to a higher quality score within our given range.
However, there seems to be a slight dip in quality around 9-10 % alcohol. Since alcohol has the strongest correlation, analyzing this dip in detail in the plot below. The X & Y axis have been interchanged, and a stat_summary line plots the mean alcohol % for a given quality
It appears there is an inflection point at quality =5, where the line changes slope. I.e. this means that for lower quality wine (<5), the correlation is negative and for higher quality wine, the correlation is positive.
Trying to find these correlation factors below
## [1] -0.05862333 0.46756642
Hence, as it was suspected, the correlation between alcohol & quality is -0.0586 if quality is less than 5. If the quality is equal or more than 5, then the correlation is 0.467. This means that a simple linear model will not be very accurate. A conditional or non linear model may be more suitable.
Plotting similar comparison histograms for density, bound sulfur dioxide, chlorides and total sulfur dioxide below
The following were the key takeaways from the above charts
1.Density and quality have a weak negative correlation (-0.3), with higher densities associated with lower quality
2.Bound Sulfur Dioxide (-0.22)& Chlorides (-0.21) have very weak negative correlations, with quality generally falling as values of these variables increase
3.Total Sulfur Dioxide and quality has the weakest negative correlation (-0.175) among the pairs which we selected. It is weak enough to not be able to draw conclusive results
However, a point to note is that for both bound & total sulfur dioxide, there is a weak positive correlation which then turns negative at around 5 quality level.This could be because at that levels it starts to negatively impact taste
The other variable pairs which had a high correlation among them are as below
Total sulfur dioxide and bound sulfur dioxide: 0.922
Density and residual sugar: 0.839
Alcohol and density: -0.780
Total sulfur dioxide and free sulfur dioxide: 0.616
Bound sulfur dioxide and density: 0.504
Clearly, sulfur-dioxide levels (total vs bound or total vs free) will be strongly correlated to each other. Leaving those aside , we get Density VS Alcohol, Density VS sugar & Density VS Bound SO2
Alcohol and density have a strong negative correlation (-0.780). This may seem logical as alcohol is volatile liquid with a low density. Higher alcohol % would then tend to reduce the overall density of the wine
Sugar, has a bit the reverse effect of alcohol. It is strongly positively correlated (0.839) with density. This is probably as sugar itself is rather dense, its higher amounts increases the overall wine density
Bound SO2 has a largely positive correlation (0.504), though not as strong as sugar.
The feature of interest (quality) has weak correlation with alcohol (0.436), density (-.307),bound sulfur dioxide (-0.218), chlorides(-0.210) & Total Sulfur dioxide variables. (-0.175)
1.With Alcohol increasing from 3 to 5, the quality decreased. As the alcohol increased from 5 to 9, the quality increased.
2.Quality weakly decreases with an increase in either of the remaining four variables, i.e. density, bound sulfur dioxide,chlorides and total sulfur
However, a point to note is that for both bound & total sulfur dioxide, there is initially a weak positive correlation which then turns negative around 70-80 ppm. This could be because at that levels it starts to negatively impact taste
There were strong correlations between the following variables
Total sulfur dioxide and bound sulfur dioxide: 0.922
Density and residual sugar: 0.839
Alcohol and density: -0.780
Total sulfur dioxide and free sulfur dioxide: 0.616
Bound sulfur dioxide and density: 0.504
The relationships between Free, bound & total sulfur seem obvious. Besides those density was strongly positively linked with residual sugar (sugar has high density) and negatively linked with alcohol (alcohol has low density). Density was also weakly positively linked with sulfur dioxide
Excluding the correlation between total & bound sulfur (as they are related variables), the best correlation was between density and residual sugar.
The last section brought to light some interesting relationships between the various variable pairs. In this section, I will deep dive into some of those relationships, adding adding data from additional variables to look for further insights
Plotting below the alcohol vs residual sugar,where color depends on quality
We can see that residual sugar is higher at lower alcohol levels, and lower at higher alcohol levels. This also fits with the understood logic that the sugars are fermented to create alcohol in the process of wine making. Also, higher quality wines tend to have higher amounts of alcohol
Plotting below to check if a similar relationship exists between quality, density & residual sugar
The plot above reflects the findings of two of our earlier discussion points. Quality is generally falling as density increases, and that higher densities are associated with higher residual sugar levels
Plotting below to check if a similar relationship exists between quality, bound sulfur dioxide and residual sugar
This graph shows that elevated levels of bound SO2 tend to be positively correlated with higher amounts of residual sugar. This sounds logical as sugar is known to bind with sulfur to create bound SO2.Also, better quality is generally associated lower levels of bound sulfur dioxide
On the same topic, plotting the ratio of bound So2 / total SO2 with residual sugar
The graph shows some interesting results. At most low levels of quality, its is showing a negative correlation between residual sugar and proportion of bound So2. But at most high levels of quality, this correlation seems to become positive ! As they say, correlation does not imply causation, but this is still a point to note
Leaving aside quality, the other highly correlated features were alcohol, density and residual sugar. To better understand their distribution, I have distributed these variables below into “buckets”
Alcohol and density are negatively correlated, with high sugar contributing to higher densities
Faceting this graph by the quality levels yields the result below
High quality (quality >6) wines tend be more in the bottom right corner, with higher alcohol and lower densities. Quality 9 wines are in fact mostly low sugar, low density and high alcohol !
A linear model was attempted from this data, to predict wine qualities. Starting with the highest correlation variable (alcohol), each of the other variables were added one by one to select the best model
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = df)
## m2: lm(formula = quality ~ alcohol + density, data = df)
## m3: lm(formula = quality ~ alcohol + density + chlorides, data = df)
## m4: lm(formula = quality ~ alcohol + density + chlorides + pH, data = df)
## m5: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide,
## data = df)
## m6: lm(formula = quality ~ alcohol + density + chlorides + pH + fixed.acidity,
## data = df)
## m7: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide +
## total.sulfur.dioxide, data = df)
## m8: lm(formula = quality ~ alcohol + density + chlorides + pH + fixed.acidity +
## residual.sugar, data = df)
## m9: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide +
## total.sulfur.dioxide + citric.acid, data = df)
## m10: lm(formula = quality ~ alcohol + density + chlorides + pH + fixed.acidity +
## residual.sugar + sulphates, data = df)
## m11: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide +
## total.sulfur.dioxide + citric.acid + free.sulfur.dioxide,
## data = df)
## m12: lm(formula = quality ~ alcohol + density + chlorides + pH + fixed.acidity +
## residual.sugar + sulphates + volatile.acidity, data = df)
##
## ================================================================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12
## ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 2.582*** -22.492*** -21.150*** -21.956*** -28.574*** -30.947*** -21.620*** 163.481*** -20.523** 204.084*** -20.523** 156.891***
## (0.098) (6.165) (6.162) (6.160) (6.437) (6.378) (6.416) (18.118) (6.465) (18.788) (6.465) (18.407)
## alcohol 0.313*** 0.360*** 0.343*** 0.340*** 0.341*** 0.354*** 0.344*** 0.146*** 0.342*** 0.101*** 0.342*** 0.183***
## (0.009) (0.015) (0.015) (0.015) (0.015) (0.016) (0.015) (0.024) (0.015) (0.024) (0.015) (0.024)
## density 24.728*** 23.671*** 23.675*** 31.315*** 33.752*** 24.098*** -163.937*** 22.979*** -205.139*** 22.979*** -156.909***
## (6.079) (6.074) (6.067) (6.369) (6.348) (6.353) (18.375) (6.405) (19.055) (6.405) (18.673)
## chlorides -2.382*** -2.286*** -2.245*** -2.393*** -2.303*** -1.034 -2.378*** -0.886 -2.378*** -0.099
## (0.558) (0.558) (0.558) (0.557) (0.553) (0.562) (0.555) (0.559) (0.555) (0.544)
## pH 0.261*** 0.069 0.866*** 0.912*** 0.709***
## (0.076) (0.084) (0.108) (0.108) (0.105)
## bound.sulfur.dioxide -0.001*** -0.009*** -0.009*** -0.009***
## (0.000) (0.001) (0.001) (0.001)
## fixed.acidity -0.082*** 0.082*** 0.110*** 0.065**
## (0.016) (0.021) (0.021) (0.021)
## total.sulfur.dioxide 0.007*** 0.007*** 0.007***
## (0.001) (0.001) (0.001)
## residual.sugar 0.084*** 0.100*** 0.087***
## (0.007) (0.008) (0.007)
## citric.acid 0.130 0.130
## (0.095) (0.095)
## sulphates 0.783*** 0.661***
## (0.103) (0.100)
## volatile.acidity -1.942***
## (0.110)
## ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.190 0.192 0.195 0.197 0.198 0.202 0.213 0.223 0.213 0.232 0.213 0.278
## adj. R-squared 0.190 0.192 0.195 0.197 0.197 0.201 0.212 0.222 0.212 0.231 0.212 0.277
## sigma 0.797 0.796 0.795 0.794 0.793 0.792 0.786 0.781 0.786 0.777 0.786 0.753
## F 1146.395 583.290 396.315 300.857 301.971 247.495 265.003 233.554 221.190 210.783 221.190 235.493
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5831.127 -5822.011 -5816.089 -5814.298 -5802.348 -5767.618 -5737.632 -5766.675 -5708.822 -5766.675 -5556.370
## Deviance 3112.257 3101.773 3090.247 3082.784 3080.530 3065.535 3022.369 2985.588 3021.206 2950.671 3021.206 2772.590
## AIC 11684.782 11670.255 11654.021 11644.177 11640.596 11618.695 11549.237 11491.264 11549.350 11435.643 11549.350 11132.740
## BIC 11704.272 11696.241 11686.504 11683.157 11679.576 11664.171 11594.713 11543.236 11601.323 11494.113 11601.323 11197.706
## N 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898
## ================================================================================================================================================================================================
The highest R squared value achieved was 0.278, which does not amount to a very robust model. Possible explanations could be that either the model is not linear (which some of our analysis points towards) or that there are other features which are affecting the final quality
In this section, I observed the following relationships
Quality vs alcohol & Residual Sugar: The positive correlation between quality and alcohol of the earlier analysis was added another dimension by residual sugar. Lower residual sugar tended to increase the alcohol content, and hence increase the positive impact on quality
Quality vs density & Residual Sugar: Higher residual sugars increased the density, which tended to lower the overall quality of the wine. Hence, residual sugar & density strengthened their negative impact on the overall quality.
Quality vs bound sulfur dioxide & Residual Sugar: Higher residual sugars tended increased the amount of bound sulfur dioxide, which increased the negative impact on the quality of the wine
One surprising interaction was the one between bound sulfur dioxide and residual sugar. It seems higher sugar tends to bind more Sulfur di oxide,negatively impacting the overall quality of the wine
The other interesting interaction was between alcohol, density and residual sugar. Higher sugar correlates with higher density, while higher sugar also correlates with lower alcohol.Given the individual correlations of each of the variables with quality, its makes sense that the best quality wines tends to have low residual sugar, low density & high alcohol.
A linear model was fitted into the data set which attempts to predict the quality of the wine based on the the other input features.
As the number of input features was increased, the model generally improved, except for certain variables, with the highest R squared value of 0.278 which includes all the variables
This value is not particularly high. As discussed before, this could be because the variable relationships are not linear or that there are some unknown variables also affecting quality. Also, some input variables are internally correlated to each other
This plot shows the important relationship between alcohol & wine quality. Although it has a positive correlation of 0.436, there seems to be a slight dip in quality around 9-10 % alcohol.
It appears there is an inflection point at quality =5, where the line changes slope. I.e. this means that for lower quality wine (<5), the correlation is negative and for higher quality wine, the correlation is positive. This also means that a simple linear model will not be very accurate in predicting this behavior. A conditional/non linear model may be more suitable.
This scatter plot shows the bound sulfur dioxide vs residual sugar & quality.High residual sugar tends to increase the amount of bound sulfur dioxide, as higher sugar tends to covert free Sulfur di oxide into bound sulfur dioxide.
High sugar and consequently higher bound SO2 negatively impacts the quality of the wine, which can be seen in the graph. Higher quality wines generally tend to have lower bound SO2, with none of the highest quality (9) wines having bound SO2> ~ 120
This scatter plots shows Alcohol % VS density, with the color associated with the level of residual sugar
Alcohol and density are negatively correlated. In addition, as sugar has itself a high density, higher amounts of sugar tend to increase the overall density of the wine.
This project has helped increase both my knowledge of exploratory data analysis using R and of white wines !
This database contained 4898 data points about white wines, with each data point associated with 11 characteristics such as residual sugar, density, alcohol etc along with an overall wine quality score
I started off with getting to understand the data and exploring the given variables through individual variable plots. I then plotted variables against each other and the overall wine quality to better understand the nature and amount of their impact.Finally, selecting those variables which had the maximum impact, I performed advance plotting, visualizing and understanding multiple variables in each plot to draw interesting insights from the data
Certain variables like alcohol had a positive correlation with quality, while certain variables like density had a negative correlation with quality. It was interesting that many of these variables were also affecting each other, e.g. how alcohol and density are negatively correlated. I also created a new variable “bound sulfur dioxide” which eventually uncovered an interesting relationship with residual sugar.
In the end, I created a linear model to try and predict the quality of the wine based on the wine characteristics. The overall robustness of this model did not come out to be very high. I tried to solve this by adding one variable at a time, to try and find a combination which fits best. This process somewhat improved the R squared value, but was still not high enough to “Pop the Champagne” ! Perhaps opening some regular wine would suffice !
This could mean that either the model is not linear or there are other unknown variables affecting the final quality.
It is important to remember that correlation does not imply causation ! Hence, drawing conclusions from correlation alone is not recommended although it can give a starting point from where further analysis can be undertaken. In this project, further steps could be taken to try to fit a non linear model or explore if there are any other unknown variables present.
Nonetheless, it was a learning experience where I picked up knowledge of both the concepts and the tools of Exploratory Data Analysis